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ly-v . Abstract 

In response to an increasing availability of statistically rich observational data 
sets, the performance and applicability of traditional Atmospheric Cherenkov 
Telescope analyses in the regime of systematically dominated measurement un- 

_^ , certainties is examined. In particular, the effect of systematic uncertainties 

f->i ■ affecting the relative normalisation of fiducial ON and OFF-source sampling 

regions - often denoted as a - is investigated using combined source analysis as 
a representative example case. The traditional summation of accumulated ON 
and OFF-source event counts is found to perform sub-optimally in the studied 

^ ' contexts and requires careful calibration to correct for unexpected and poten- 

tially misleading statistical behaviour. More specifically, failure to recognise 
and correct for erroneous estimates of a is found to produce substantial overes- 

^ ] timates of the combined population significance which worsen with increasing 

CO ' target multiplicity. An alternative joint likelihood technique is introduced, which 

^ I is designed to treat systematic uncertainties in a uniform and statistically ro- 

bust manner. This alternate method is shown to yield dramatically enhanced 
performance and reliability with respect to the more traditional approach. 
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1. Introduction 

Recent developments involving Imaging Atmospheric Cherenkov Telescopes 
(lACTs) have revolutionised the field of Very High Energy (VHE) 7-ray as- 
tronomy. Indeed, observational data obtained using the current generation of 
stereoscopic arrays have contributed over 100 new sources to the catalogue of 
confirmed GeV-TeV emitterso. 
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Notwithstanding these impressive achievements, the motivation for astronomers 
to continue identifying and categorising additional 7-ray sources and source 
populations remains undiminished. The initial source detections are likely to 
be dominated by luminous or nearby 7-ray emitters, implying a requirement 
for subsequent observations to target progressively fainter source populations. 
Accordingly, the development of analytical techniques which enable maximal 
exploitation of the available instrumental sensitivity will become increasingly 
important as the life-cycles of the various lA CT arrays unfo ld. This emerging 



philosophy is exemplified by the recent work of lKlepseiJ (|2011l ) which employs ac- 
curate knowledge of the instrumental response to enhance the performance and 
rel iability of an esta blished technique for source detection originally developed 
bv lLiandM"al(|l983l) . 



Until recently, the inherent faintness of astrophysical VHE 7-ray fluxes has 
resulted in statistically limited signal measurements with uncertainties that are 
dominated by Poissonian shot noise. Accordingly, many traditional lACT anal- 
yses have justifiably neglected the existence of previously sub-dominant system- 
atic effects. In contrast, successful source detection at the limits of instrumental 
sensitivity requires deep observations and the accumulation of rich data sets with 
abundant event statistics. In this new regime, the relative contribution of sta- 
tistical fluctuations to the overall error budget is suppressed with measurement 
accuracy ultimately becoming systematically limited. 

This work investigates methods for the appropriate statistical treatment of 
systematic uncertainties affecting lACT observational data and their subsequent 
analysis. In Cherenkov astronomy, the synthesis of large data sets typically in- 
volves aggregation of temporally sparse event statistics which correspond to 
distinct values of various observational parameters, each of which can intro- 
duce distinct, observation-specific systematic biases. Accordingly, analyses that 
are designed to ameliorate the impact of irreducible systematic effects must 
incorporate sufficient flexibility to successfully model this inter-observational 
variability. 

In the subsequent discussion, the relevant statistical issues and the tech- 
niques designed to address them are illustrated using the specific example of 
stacking analysis. In such analyses, measurement data that correspond to mul- 
tiple, independent but similar experiments are combined to enhance their scien- 
tific utility. In the context of 7-ray astronomy, stacking analyses are commonly 
applied in situations when a target population comprises a large number of 
individually undetectable source candidates that are theoretically expected to 
exhibit at least one identical signal characteristic. In the absence of genuine 
7-ray emission, the superimposed observational data will be consistent with 
random Poisson fluctuations about the mean background level. Conversely, 
multiple faint signals produced by a genuine sample of faint 7-ray sources will 
reinforce upon combination, yielding a significant overall excess with respect 
to the measured background. Accordingly, stacking analyses often facilitate the 
derivation of average source properties for the target population as a whole, even 
when detection of individual 7-ray signals is rendered impossible by inadequate 
instrumental sensitivity. The specific consideration of stacking is instructive 



because combined source analysis inherently segregates subsets of observational 
data which are likely to incorporate distinct systematic biases. Moreover, stack- 
ing analyses provide a more intuitive distinction between systematic effects that 
undergo temporal variation, such as telescope elevation or optical efficiency, and 
those that vary on a target-wise basis, such as the ambient night-sky brightness 
of the number of 7-ray sources within the telescope field-of-view. 

Compelling and theoretically well motivated targ et populations fo r com- 
bined source analyses include X-ray binary syste ms (Dickinson . 2009 ). com- 
bined pulsar populations within globular clusters (JAharonian et all . l2009l ) and 
the self-annihilating dark matter com ponents of dwarf spheroidal galaxies (e.g. 



H. E. S. S. Collaboration et al.l . l201lD . 



1.1. Relevant Aspects of the Observational Technique 

The atmosphere of the Earth is almost completely opaque to multi-GeV 
and TeV radiation and accordingly, the operational mode of lACT arrays relies 
upon indirect detection of astrophysical 7-rays. Indeed, the majority of incident 
VHE photons undergo electron-positron pair production interactions within the 
Coulomb fields of stratospheric nuclei. Frequently, these interactions initiate 
electromagnetic cascades of charged particles, which radiate Cherenkov light as 
they descend superluminally through the dielectric medium of the atmosphere. 
lACT arrays are designed to image the resultant distribution of Cherenkov pho- 
tons at ground-level, using the encoded information to reconstruct the proper- 
ties of the progenitor 7-ray. The set of reconstructed 7-ray properties define 
a multi-dimensional parameter space, with suitable sub-spaces that define the 
event populations that are required for specific astrophysical analyses. 

The situation is complicated somewhat by the fact that lACT arrays operate 
in a strongly background dominated regime and must successfully distinguish 
a comparatively weak 7-ray signal from an almost overwhelming background 
of air-showers initiated by hadronic cosmic rays. Fortunately, the properties 
of hadron- and 7-ray-initiated air-showers are sufficiently disparate that an 
effective segregation is possible based on the characteristics of the captured 
Cherenkov images. Traditionally, a simplified elliptical parameterisation of the 
Cherenkov light distribution was used in conjunction with lookup tables of 
simulated air -shower properties to separate the signal and backgro und event 
populations (jCawlev et all . Il985t iHillad . Il985t IWeekes et aP . Il989h . Modern 



com putational resources permit utilisation of the comple te Cherenkov images 



(e.g iLe Bohec et al.l . Il998t Ide Naurois and Rol 



andl. 200911 and the application 



Fiasson et al.l . I2OIOI ) to further 



of advanced machine-learning algorithms (e.g 
enhance the background rejection efficiency. 

Despite these sophisticated filtering techniques, the sheer number of air- 
showers that trigger an lACT array during a given observation inevitably results 
in a subset of hadron-originated Cherenkov images which are sufficiently 7-ray- 
like in appearance to survive the event selection procedure. Quantitatively, the 
probability that a 7-ray-like event having specific reconstructed parameters will 
pass background rejection is described by a multi-dimensional acceptance func- 
tion. The reconstructed 7-ray signal (Nqn) within a suitably defined sampling 



region of the event property parameter space is a superposition of the true 7-ray 
signal (iVs) on an irreducible background (A'b) of spuriously classified events. 

A^ON =Ns+Nb (1) 

The magnitude of the background component is typically estimated using the 
number of 7-ray-like events {Nqpf) that are reconstructed within one or more 
nominally OFF-source regions. Adopting this convention, ([Ij is normally re- 
expressed as 

A^ON ^Ns+ aNoFF (2) 

where a is a normalisation factor which compensates for any disp arities in the 



instru mental response among the various sampling regions (see e.g. lBerge et al. 
20071 for a more detailed discussion). 



Fundamentally, the precise value of a is dependent upon specific charac- 
teristics of the individual observations that contribute to the overall data set. 
Numerous variable factors such as the target zenith and azimuthal angles, the 
configuration and functionality of individual telescopes within a larger array, 
or the ambient atmospheric conditions may modify the nominal system perfor- 
mance. 

The functional dependency of a on the various observational and instru- 
mental parameters is complicated and is must typically be resolved for each 
observation using a mult i -dime nsional model estimate of the prevailing system 



acceptance (|Berge et all . 120071 ). Candidate models may be synthesised by in- 
terpolating the distributions of representative 7-ray-like background data sets, 
assembled from multiple independent observations of empty fields-of-view. Ac- 
curate replication of the true system acceptance using this technique requires an 
extensive database of suitable observations with characteristics that sample the 
entire space of observational parameters with adequate resolution. For certain 
well-understood parameters it may be possible to derive semi-analytic parame- 
terisations of the corresponding systematic distortions, which can improve the 
interpolation accuracy and reduce the required sampling density. Nonetheless, 
for complicated instruments with a large number of possible configurations, the 
observation time required to populate the reference database may be prohibitive. 
Alternatively, response models which inherently incorporate the appropriate sys- 
tematic ofi'sets can be constructed on an observation- wise basis using the subset 
of 7-ray-like background events that fall outside of the fiducial sampling regions. 
However, this approach is susceptible to contamination by imperfectly excluded 
7-ray sources and is ultimately limited by the availability of event statistics 
within a single field-of-view. 

For the current generation of lACTs, residual discrepancies between the 
generated model and the true instrumental response reflect a superposition of 
imperfectly modelle d systematic offsets and are typically at the few percent level 



(jBerge et al.l . 120071 ). However, if observational situations arise in which both 
approaches for modelling the system acceptance are compromised, the resultant 
disparity could be much larger. For example, if the telescope configuration or the 
prevailing operating conditions restrict the availability of appropriate archival 



background data and the target field-of-view is crowded, then systematic effects 
could dominate statistical uncertainties when searching for faint 7-ray sources. 
Moreover, since the combined impact of multiple systematic effects is often 
reflected by non-uniform variations in the system acceptance, the relative extent 
and location of the sampling regions within the event property parameter space 
may also bias the estimated value of a. Most experiments tacitly acknowledge 
the uncertainty in a by attaching a conservative systematic error term to their 
quoted results. 

1.2. Alternative Stacking Analyses 

Conventional stacking analyses typically entail simple summation of the ob- 
served event counts associated with the individual targets in the population 
sample, with the aggregate datasets forming the basis for subsequent astro- 
physical inference. However, the background estimation procedures outlined in 
ij 1.1 1 render this traditional approach (hereafter referred to as data stacking) sub- 
optimal in the context of VHE 7-ray observations. In particular, data stacking 
analyses are complicated somewhat by the requirement to synthesise a com- 
bined value of the normalisation parameter a which corresponds to the stacked 
dataset. A viable approach uses the definition of the overall signal excess, 

A = iVoN - aA^oFF 

= y^ A^ON.i - « y^ A^OFF.M i = l,...,m 

i i 

= ^ AoN.j - ^ajAoFF,i, i = l,...,m (3) 

i i 

to form an average over all m targets in the sample, weighted by the individual 
off-source counts. 



2J a^AoFF,; 



Y^ NoYF, 



z = l,...,m (4) 



Unfortunately, this technique fails to account for any target-specific uncertain- 
ties that arise during derivation of the individual a^. Even if no systematic 
biases are introduced, the diverse observational parameters that correspond 
to each individual target dataset will likely affect accuracy to which the cor- 
responding values of a^ can be determined (JBerge et al.l . 12007 ). The resultant 



non- uniformity renders subsequent propagation of the individual error estimates 
to the value of a somewhat problematic for the data stacking approach. Indeed, 
it is often unclear how the individual uncertainties should be combined in a sta- 
tistically consistent manner, particularly when they are characterised by asym- 
metric intervals or more complicated probability distributions. Assignment of 
a uniform but conservative systematic error term to all of the a^ renders the 
problem tractable, but inevitably results in an overall loss of sensitivity. 

Moreover, indications exist that the shortcomings of the d ata stacking tech- 
nique can be problematic in experimental situations. Indeed. iDickinsonI ( 20091 ) 



derived anomalously high combined significance results when applying a data 
stacking procedure to two independent target ensembles comprising 37 and 44 
putative VHE 7-ray sources for which A^oFF,i ~ 100 — 500 a nd aj ^ 0. 7. B; 



studying the ON and OFF-source event count distributions, iDickinsonI ()200' 
concluded that a relative target-wise dispersion Aai/a^ < 0.1 was required to 
explain the derived significances in the absence of a true 7-ray signal. 

In subsequent sections, the statistical behaviour of the data stacking ap- 
proach is investigated, highlighting the limited applicability of this technique in 
scenarios for which a is uncertain. In response to these inherent shortcomings, 
a statistically robust method for the combination of lACT observations is out- 
lined, which allows systematic uncertainties to be treated on a target-by-target 
basis. Using this technique, observations yielding weakly constrained values 
of ai may be usefully and consistently combined with more reliable datasets, 
without diluting the scientific utility of the latter. 

2. Derivation of the Joint Likelihood 

The new method operates by defining target-specific likelihood functions 
which facilitate appropriate treatment of relevant systematic uncertainties. Stack- 
ing is implemented by forming a product of individual target likelihood func- 
tions and using this combined likelihood to estimate the shared characteristics 
of the putative 7-ray signals. It should be emphasised that the applicability 
of the method is contingent upon an implicit assumption of at least one iden- 
tical signal characteristic for all targets comprising the sample. Realistically, 
signal characteristics for which the required assumption of universality is rea- 
sonable are typically target-specific functions of the experimental observables. 
As discussed in the subsequent paragraphs, disparities between putative sub- 
populations comprising the target sample are inherently problematic in the con- 
text of data stacking. In contrast, the suggested approach intrinsically facilitates 
straightforward customisation of the individual target likelihoods. 

Construction of the single target likelihood function proceeds under the as- 
sumption that the sample of 7-ray-like events that are reconstructed within a 
nominal sampling region is drawn from a parent population with Poisson dis- 
tributed arrival times. Accordingly, if the value of a were precisely constrained, 
then (m would imply 

TVoN - Pois(iVoN, ^s + aNoFF) and iVoFF - Pois(7VoFF, ^off) (5) 

where A^on and A'^qff denotes the true mean value of the Poisson distributed 
variables A'^qn and A'qff respectively. The new parameter a represents the 
unknown, true value of the derived normalisation parameter. In fact, the as- 
sumed validity of ([S]) is implicit in the derivation of a in Q with likely devia- 
tions from this idealised behaviour undermining the reliability of the traditional 
data stacking framework. Conversely, the combined likelihood approach treats 
target-specific systematic uncertainties by modelling a as a random variable 
with arbitrary distribution G(a|a). As indicated in §1.11 the specific functional 



definitions of G{a\a) are dependent upon observational parameters which are 
hkely to vary on a target-by-target basis. 

The probabiUty density function which models an abstract single-target 
dataset D = {A'^oN,-^OFF,ct} is simply the product of the distributions of its 
component data. 

fi{No^,NoYY,a\Ns,NoYF,a) = Pois(iVoN,^s + uNofy) 

•P0is(AfoFF,iVoFF) (6) 

• G{a\a) 

Accordingly, substitution for D using a concrete observational dataset d = 
{noN, "-OFF, otohs} enables calculation of the target-specific likelihood describing 
the true 7-ray signal. 

Li(iVs, A^OFF,a|noN,'^OFF,aobs) = /i("-on,"-off, aobslA'^s, ^off, a) (7) 

Modelling systematically uncertain parameters as random variables with known 
probability density functions is a common statistical approach , with applications 



exten ding beyond the limited domain of stacking analyses fe.g. lHeinrich and Lyons 
I2OO71 ). Indeed, equations dU and ([7]) are equally valid in single target lACT 
analyses that must accommodate non-negligible systematic uncertainties on the 
estimated value of a. 

The ultimate goal of this stacking procedure is to obtain improved con- 
straints on the global 7-ray signal characteristics. In this context, A'^qff and a 
are not of primary interest and are therefore categorised as nuisance parameters 
throughout the subsequent analysis . 

Finally, the joint likelihood describing the global signal characteristics of 
the entire ensemble of m targets is simply the product of the individual target 
likelihoods. 

m 

L}.ni{Ns) = JJil,i(^S,^OFF,i,fii|'^ON,i,'^OFF,i,aobs,i) (8) 

1=1 

Accordingly, the set of values which parameterise Lj is the union of the param- 
eters of the component single-target likelihood functions and therefore incorpo- 
rates 2m uncertain nuisance parameters. 

3. Inference of the Signal Characteristics 

3.1. Detection Significance 

The initial objective of a stacking analysis is typically the conclusive detec- 
tion of a combined 7-ray excess, with detailed investigation of specific signal 
characteristics assuming subsidiary importance. In such situations, the proba- 
bility of detection is typically evaluated in terms of the the significance, S with 
which experimental data exclude the null hypothesis of zero signal. 



For an abstract likelihood function L{tz,9\D) relating the experimental data 
D = {Di...Dn} to multiple nuisance parameters, 6 = {9i, . . . ,9j} and pa- 
rameters of interest tt = Jtti, . . . , tt;}, a popular formula for calculation of de- 



tection significance (iLi and Mai 119831 ) employs the likelihood ratio defined by 
( Nevman and Pearsonl 119331 ) 



Alm — 



L{7Vo,eo\D) 
Lin,9\D) 



(9) 



where a caret denotes the maximum likelihood estimate of the accented symbol, 
ttq denotes the value of tv which corresponds to the null hypothesis, and Oq 
represents the conditional maximum likelihood estimate for 9 given that tt = ttq. 
The significance is subsequently defined in terms of Alm 



S= V-21nA 



LM 



(10) 



which reduces to 
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as reported by (jLi and Mai . 1 1 9831 Equation 17) and is nominally xi distributed 
for a typical, single-target dataset with negligible a uncertainty i.e. D = 
{-^ON, -^OFF, a}. Proponents of data stacking typically emulate the single- 
target approach by using Equation 17 in conjunction with a pseudo-dataset 
D' = IX^i -^ON,i, X^i -^OFF.ijCk}. Consequently, non-uniform observational de- 
pendencies the individual a^ become entangled, which inevitably complicates 
statistically robust treatment of their effects. Conversely, direct substitution of 
Lj in the calculation of Ala/ in the joint-likelihood approach ensures that the re- 
sultant expression for S is constructed using all available information regarding 
target-specific system atic u ncertainties. 

Recently. Klepser (j201l[ ) proposed an extension to the method suggested by 
Li and Mai ( 19831 ) which uses well-calibrated estimates of the instrumental point- 
spread function to achieve enhanced sensitivity relative to the more traditional 
analysis. Their technique ameliorates the effect of systematic variations in the 
telescope acceptance characteristics by segregating the observational data on 
the basis of prevailing instrumental and environmental parameters. 

3.2. Confidence Intervals 

Results that are derived from a joint likelihood stacking analysis comprise es- 
timates of the various experimental parameters of interest, with any uncertainty 
quantified using associated confidence intervals. The frequentist definition of a 
properly constructed 100(1 — e)% confidence interval states that 100(1 — e)% 
of such intervals, generated from a large number of independent experimen- 
tal measurements of a quantity X, will contain (or cover) the true value X. 
Confidence intervals which fulfil this criterion are described as having correct 



coverage. Construction of confidence intervals typically uses experimental data 
to define a test statistic which is subsequently used to identify regions of the 
relevant parameter space that fulfil the required statistical criteria. The profile 
likelihood is an appropriate choice of test statistic in situations that require the 
treatment of multiple nuisance parameters. The profile likelihood is defined as 
the ratio 

Ap.(.|i.) = ^(-'^-'^) (12) 

where previously encountered symbols retain their earlier definitions and 0^ 
represents the conditional maximum likelihood estimate for assuming a specific 
value of TT. Conveniently, the distribution of — 21ogApL converges to that of 
a x^ rando m variate with I degre es of freedom for large experimental data 
sets (e.g. iCasella and Bergeill2002l ). Accordingly, straightforward derivation of 



confidence intervals is facilitated by comparison of the derived test statistic with 
appropriate percentiles of the relevant y^ distribution. Moreover, the profile 
likelihood has the desirable property of being independent of 0, facilitating 
the derivation of robust confidence intervals, even in the presence of uncertain 
nuisance parameters. 

4. Monte Carlo Studies 

Verification of the performance and reliability joint likelihood stacking pro- 
cedure emp loyed a computerised toy M onte Carlo approach. The RooFicI 
framework ( Verkerke and Kirkbvl 120061) was used to generate multiple simu- 



lated datasets consistent with the assumed distributions of A^on, A^off and a 
for specific values of the true signal parameters iVs S [0, 10], iVoFF = 100 and 
a = 0.1. To better understand the efi^ect of the distribution of a on the out- 
come of each stacking procedure, data were generated using three alternative 
parameterisations for G{a\ai). A basic test case, simulating unbiassed system- 
atic uncertainties, employed a Gaussian parameterisation G{a\a) — Gaus(Qf, <Ta) 
with ctq, = 0.02 (hereafter referred to as Model A). Two less idealised examples, 
adopting bifurcated Gaussian functions 

rif \~\ / Gaus(a, (Tq^l) if a < a , , 

G(a\a) ~ < r^ )~ \ -r ~ (13) 

^ ' ' \^ Gaus(a, ctq^r) it a > a ^ ' 

with (crQ,L,crQ,R) — (0.01,0.03) and (ctq^l, ctq.r) ~ (0.02,0.08) (hereafter re- 
ferred to as Model B and Model C, respectively), allowed the effect of increasing 
asymmetry in the modelled a distribution to be investigated. Figure [1] shows 
representative distributions of a corresponding to each of the simulated scenar- 
ios. Models A and B mimic somewhat pessimistic observational situations in 
which the choice of OFF-source sampling regions is restricted, limiting event 
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Figure 1: Generated Monte Carlo distributions of the ON/OFF normalisation a corresponding 
to Model A (black), Model B (red), and Model C (blue). 



statistics and biasing the derivation of a. Accordingly, the corresponding re- 
sults are particularly relevant to galactic source populations, for which source 
confusion and diffuse background contamination within target fields-of-view can 
be particularly problematic. Model C represents an extreme case which is un- 
likely to occur in real experimental contexts, but is nonetheless included to 
demonstrate the efficacy of the joint likelihood technique. 

Using RooFit, ensembles of the generated single-target datasets were then 
employed in conjunction with the relevant compound probability density func- 
tions to obtain multiple realisations of the joint likelihood Lj,m corresponding 
to a discrete range of target multiplicities me [1, 10]. For each ensemble dataset 
Dm, the statistical significance (iSj) of th e combined 7-ray signal was evaluated 
using the likelihood ratio prescription of ( Li and Mal . ll983[) 



Sj = -2los 



Lj,m{Ns=0,eo\Dra) 



Ot = {A^OFF,^, aO, i = 1 . . . m. (14) 



For comparison, a traditional data stacking analysis was also applied to the 
generated data. The components of the ensemble datasets used to construct 
each realisation of Lj_m were combined to obtain corresponding stacked datasets, 
^'m = {Si-'^ON,i)X]i-^OFF,zia}- Subsequent statistical inference employed a 
likelihood function I/dS) which was formed by substitution of the stacked data 
into the single target probability density function /i with G defined as a delta 
function centred on the stacked a estimate, G(a|Qf, a) = 5{a — a). Following the 
convention outlined in fj3l Equation (jlip was evaluated using the components 



10 



of D!^ do derive the statistical significance ^ds of the combined 7-ray signal 
corresponding to the data stacking approach. 

For both stacking methodologies, construction of the associated profile like- 
lihoods was accomplished using the intrinsic capabilities of RooFit, adopting 
A'^g as the sole parameter of interest. Confidence intervals were then defined as 
the ranges of iVg outside which —2 log Apl exceeded an appropriate percentile 
of the nominal Xi distribution. 

5. Comparison of the Alternative Stacking Techniques 

For each distinct combination of the true signal parameters, modelled dis- 
tribution of a, and target multiplicity, the Monte Carlo approach outlined in 21 
was applied to derive 5000 independent estimates of Ns and S corresponding 
to each stacking technique. In combination with appropriate statistical metrics, 
the resultant data enable a comparative evaluation of the joint likelihood and 
traditional data stacking approaches. 

5.1. Detection Significance 

An experimentally relevant diagnostic for both stacking analyses is the prob- 
ability with which data corresponding to a particular value of iVg will yield a 
significance S, in excess of a specified threshold St. Assuming data that are 
distributed in accordance with the null hypothesis, this probability is called the 
significance level. Although the specific choice of St is arbitrary, it is typically 
chosen in order to obtain a desired significance level and ensure a controlled 
rate of false-positive detections. Objective comparison of the alternative stack- 
ing techniques is facilitated by a threshold which corresponds to a nominal per- 
centile of the appropriate null distribution of S. A related statistic is the power, 
which describes the probability that derived values oi S > St will correctly 
identify a genuine signal, and provides useful insights regarding the sensitivity 
of the corresponding analytical method. 

The relevant Monte Carlo distributions for S, corresponding to Ng — 0, 
are plotted for each target multiplicity in Figure [2J Assuming that the true 
value of a is 0.1, systematic uncertainties which affect the estimation of ai ac- 
cording to the distributions plotted in Figure [1] will generate misleading results 
in a data stacking analysis. In many experimental situations, a small number 
of erroneous results is tolerable, as long as the expected frequency with which 
they occur can be reliably predicted. This predictability criterion is realised 
if the derived values consistently conform to a known statistical distribution. 
Irrespective of G{a\a), and for all m S [0,10], the significance estimates de- 
rived using the joint likelihood technique appear to retain the xi distribution 
which arises in case of zero a uncertainty. In contrast, results that correspond 
to traditional data stacking exhibit substantial deviations from this nominal 
distribution. Moreover, adoption of asymmetric a distributions introduces a 
monotonic dependence on the target multiplicity, yielding systematically larger 
significance estimates as m increases. 
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Consideration of the procedure used during data stacking to synthesise a 
combined a estimate suggests an intuitive explanation of the observed be- 
haviour. Models B and C generate ensemble data sets in which the subset of 
overestimated a values is typically predominant and has an absolute mean offset 
from a which exceeds that of the remaining, underestimated a^. Accordingly, 
the frequency with which Equation (JH) yields a substantial overestimate for a 
should increase with the ensemble size, which is consistent with the observed 
behaviour. 

Conversely, symmetrically distributed a uncertainties should not inherently 
bias the data stacking process. Instead, the target-specific weighting of each a^ 
specified in Equation ^ prevents perfect cancellation of the systematic offsets 
and inhibits convergence to the nominal null distribution for S" as m — > oo. As 
outlined in 21 all Monte Carlo data sets used for this study assume a common 
A^OFF = 100, with the range of inter-target variability in A^off.i restricted to 
the level of Poisson fluctuations. In experimental contexts, the true number of 
OFF events is likely to be different for each target and the weights accorded to 
the corresponding ai would be more diverse. The adverse impact of symmetric 
systematic uncertainties would then depend critically on the joint, target-wise 
distribution of a-offset and A^oFF,i, with specific ensembles potentially realising 
particularly benign or pathological scenarios. 

As an alternative to the absolute measure of significance defined by Equation 
(jlip . experimental practitioners often choose to multiply S by the sign of the 
corresponding 7-ray excess. Adopting this convention, systematic overestimates 
of a during data stacking would tend to over-generate negative significances 
when A^s = 0. In many situations the notion of a negative signal is not physically 
meaningful and such results would probably be disregarded. In general, the 
ability to eliminate spurious results on the basis of their physical validity may 
ameliorate the impact of particular systematic effects, and reduce the rate of 
false-positive detections. Nonetheless, unfeasibly negative significances should 
raise concerns regarding the trustworthiness of corresponding upper limits. 

Significance thresholds corresponding to the 95th percentile of the various 
null distributions of S are plotted in the left-hand column of Figure [31 Re- 
sults derived using the joint-likelihood approach exhibit minimal multiplicity 
dependence and appear consistent with the nominal St = 1.96 expected for a xi 
random variate. Conversely, thresholds which correspond to the data stacking 
approach systematically exceed their joint likelihood counterparts, and become 
increasing discrepant for larger m in simulations that assume asymmetric dis- 
tributions of a. Accordingly, recourse to a nominal distribution of S is not 
appropriate in the data stacking framework, and valid interpretation of the 
significance of a particular detection requires empirical calibration of distinct 
significance thresholds for each target multiplicity. The values of St plotted 
in Figure [3] permit derivation of appropriately calibrated estimates of the sta- 
tistical power at each vertex of the investigated model parameter space. The 
resultant probabilities are presented in Figure |4l while the results plotted in 
the right-hand column of Figure [3] reveal the corresponding differences power 
between the alternative stacking techniques. 



12 



Results pertaining to Models A and B exhibit saturation at extreme values of 
Ns and m, caused by the limited statistics of the Monte Carlo datasets. Results 
corresponding a symmetric Gaussian distribution for a reveal derived powers 
which are compatible to < 2% for all m £ [1, 10] and iVs G [0, 10]. Although 
traditional data stacking appears to offer marginally superior performance for 
this scenario, asymmetric parameterisations for the distribution of a reverse 
this outcome. Indeed, powers derived using the joint likelihood approach in 
conjunction with Models B and C exhibit substantial enhancement with respect 
to their data stacking counterparts. Moreover, powers corresponding to the data 
stacking technique exhibit a marked sensitivity to the degree of asymmetry in 
the adopted a distribution, which appears to be substantially ameliorated by 
the joint likelihood approach. Strong suppression of the resultant data stacking 
power is evident when Model C is applied to the Monte Carlo analysis, with 
derived discrepancies ~ 0.9 separating the alternative stacking techniques at 
the upper extremes of the modelled parameter ranges. More significantly for 
practical applications, assuming the intermediate scenario represented by Model 
B, reveals a maximum disparity of ~ 0.4 for the experimentally interesting region 
of Ns ~ 5 as TO — )• 10. In this context, the joint likelihood approach yields a 
two-fold enhancement in the likelihood of detecting a genuine 7-ray signal. 

5.2. Confidence Intervals 

For an analysis to perform reliably in an experimental context, it is impor- 
tant that any generated confidence intervals have correct coverage. In reality, 
despite sophisticated treatment of associated systematic and statistical uncer- 
tainties, residual disparities between the assumed and actual distributions of ex- 
perimental data often prevent practical realisation of this criterion. Moreover, 
deviations from correct coverage often lead to an unexpected increase in the 
frequency of spuriously inferred scientific conclusions. Accordingly, the degree 
by which the derived confidence intervals deviate from their nominal coverage 
provides an informative comparator for the alternative stacking techniques. 

Figure \5\ displays the fraction of derived 95% confidence intervals for A''g 
which cover the true signal value, for each parameterisation of G{a\a). Com- 
plementary insight is provided by Figure [SI which illustrates the correspond- 
ing deviations from correct coverage as a function of iVg and the target multi- 
plicity, m. It is evident that confidence intervals generated by the traditional 
data stacking approach are generally incompatible with their stated coverage. 
While the derived coverage fractions exceed the nominal 95% for low signals 
and target multiplicities, the majority of the {Ns,m) parameter space is char- 
acterised significant negative deviations. This behaviour is typically described 
as under- coverage, and implies excessively permissive confidence intervals with 
an increased propensity to generate false-positive results. The observed under- 
coverage is maximised for small non-zero signals and appears to be exacerbated 
by the addition of targets to the ensemble dataset. Moreover, increased asym- 
metry of the modelled a distribution has a markedly detrimental effect on the 
derived coverage characteristics. Indeed, the maximum divergence from correct 
coverage increases from 4% to 25% and then to 85% for Model A, Model B, 
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and Model C, respectively. In the worst case, this means that only 10% of the 
derived intervals include the true parameter value. The inherent inability of 
the data stacking technique to acknowledge systematic uncertainties leads to 
confidence intervals which are unrealistically narrow. Consequently, reported 
accuracy of experimental measurements will be excessively optimistic and in 
the worst cases could be used to infer spurious scientific conclusions. 

In contrast, notable stability in the coverage fractions derived using the 
joint likelihood technique indicates robust handling of systematic uncertainties. 
Confidence intervals derived under the assumption of a Gaussian distribution in 
a exhibit coverage which is correct to within ^1% throughout the studied region 
of the model parameter space. Variation of the coverage fraction as a function of 
Ns remains minimal when asymmetric parametrisations of G(a|a) are assumed. 
Although under-coverage which worsens with increasing target multiplicity is 
evident in results that correspond to Models B and C, the discrepancy with 
respect to correct coverage does not exceed ^ 5%. 

6. Discussion 

The comparative evaluation presented in ^ reveals several limitations to 
the applicability of traditional data stacking in situations when a is uncertain, 
many of which are addressed by the joint likelihood technique. 

Reliable inference of signal characteristics is impaired by significant levels of 
under-coverage for Ns ^ 2 in the data stacking approach. Moreover, the inher- 
ent permissivity of derived confidence intervals is compounded by the inclusion 
of additional targets. In combination, these characteristics imply a substan- 
tially increased tendency to generate spurious results which is exacerbated as 
the stacked target ensemble is enlarged. In an experimental context, this be- 
haviour would be highly undesirable and largely undermines the motivation 
for combined source analysis. In contrast, confidence intervals provided by the 
joint likelihood approach closely approximate correct coverage under the as- 
sumption of symmetric uncertainties on a, and exhibit multiplicity-dependent 
under-coverage at < 5% levels if an asymmetric distribution is adopted. 

The deviations from correct coverage that are exhibited by the data stacking 
approach are consistent with the erroneous assumption of negligible uncertainty 
in the value of a. Resolution of this issue is complicated by the lack of a straight- 
forward prescription for propagating systematic uncertainties to the calculated 
value of a. Conversely, the joint likelihood method provides a robust technique 
for the treatment of uncertain a estimates on a target-specific basis. 

Appropriate interpretation of the combined signal significance is also more 
complicated within the data stacking framework. The Monte Carlo null distri- 
butions for 5* presented in Figure [5] indicate that application of Equation pTI) 
requires proper calibration of St to prevent spurious rejection of the null hy- 
pothesis. The situation deteriorates when increasingly asymmetric parameteri- 
sations of the generated a distribution are adopted. Under these circumstances, 
naive assumption of a nominal xi distribution for S induces a systematic over- 
estimation of corresponding significance levels which worsens with increasing 
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target multiplicity. In an experimental context, the implied increase in the 
false-positive detection rate for larger target samples is contrary to common 
expectation and therefore represents a serious shortcoming of the data stacking 
technique. In contrast, significance estimates derived for Ns using the joint 
likelihood approach appear insensitive to the value of m and display excellent 
correspondence with the expected xi distribution. 

Notwithstanding appropriate calibration of a suitable detection threshold St-, 
the powers which characterise the data stacking technique indicate that asym- 
metry in the adopted a distribution substantially diminishes the overall sen- 
sitivity. Conversely, the powers corresponding to the joint likelihood analysis 
appear relatively unaffected by the choice of G{a\a). The alternative approaches 
exhibit similar properties, and indeed data stacking appears to marginally out- 
perform the joint likelihood approach, if symmetric systematic uncertainties are 
assumed. 

It should be acknowledged that the degree of adverse behaviour inherent 
in the data stacking technique is contingent upon the adopted values of a and 
iVoFF, as well as the assumed magnitude of the simulated systematic effects. Ac- 
curate modelling of the system acceptance may ameliorate the observed patholo- 
gies, with the cumulative effect of target-specific a uncertainties only becoming 
significant for target ensembles with m 3> 10. Figure [7] illustrates the com- 
bined effect on data stacking of systematic offsets at the few percent level using 
calibrated values of the significance threshold that correspond to the 95th per- 
centile of simulated null distributions for S. The Monte Carlo data sets assume 
relative target-wise uncertainties 0.01 < (Ja.i/ct < 0.1, for both symmetric and 
asymmetric parameterisations for G{a\a). The curves correspond to simulated 
ensembles comprising tti = 50 target sources for which a = 0.1 and A'^off — 100. 
Despite an inevitable reduction in the influence of the modelled systematic off- 
sets, it is clear that even small target-specific biasses can reinforce to distort 
the combined significance for large observational datasets. Moreover, even the 
pessimistic parameter values that define Models A and B may be practically rel- 
evant, especially when the extent of OFF-source regions is restricted by nearby 
sources, or stringent event selection limits the overall count statistics. 

The joint likelihood method demonstrates good performance in a number of 
simple but experimentally relevant scenarios where the reliability of traditional 
data stacking is seriously impaired. Nonetheless, it should be stated that the 
full potential of the joint likelihood method can only be realised if the distri- 
butions of all the systematic uncertainties affect each observational dataset are 
parameterisable with sufficient accuracy. 

Admittedly, definitive derivation of the distribution of a is unlikely to be 
straightforward in many situations. Semi-analytic estimation of the required 
distributions necessitates detailed knowledge of the instrumental response as 
well as careful measurement of the environmental conditions at the time of ob- 
servation. Accordingly, imperfect or incomplete understanding of the dominant 
systematic biases might render this approach untenable. Plausible alternative 
strategies include jackknife or bootstrap resampling of OFF-source data from 
randomly selected background sampling regions to derive multiple estimates 
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for a and construct an empirical distribution. In this case, the accuracy of 
the resultant parameterisation is limited by the finite number of permu t ations 



available for the resampling process. Emulating the approach of lKlepseiJ (|201l[ ) 
and segregating data on the basis of quantifiable operating conditions would 
enable the use of large representative datasets to derive template distributions 
that pertain to finite subsets of the observational parameter space. 

In situations where the suggested techniques are unable to accurately re- 
construct the shape of the a distribution, a generic parameterisation could be 
tailored using target-specific estimates of the dispersion in derived values of 
the ON/OFF normalisation. By acknowledging the existence of the associ- 
ated systematic uncertainties, this first-order approximation would ameliorate 
under-coverage as well as deviations from the nominal xi distribution for S, 
thereby reducing the rate of unexpected false-positive detections. Moreover, by 
facilitating limited adaptation of the modelled systematic uncertainties, this ap- 
proach retains a key advantage of the full joint likelihood technique that cannot 
be straightforwardly emulated by traditional data stacking. Indeed, it is likely 
that this simplified approach might recover a significant proportion of the power 
which is achieved when the true distribution of a is available. Nonetheless, as 
demonstrated by the Monte Carlo results for a symmetric distribution of a, 
there may be situations in which the advantages offered by the joint likelihood 
approach are marginal and the performance of the data stacking approach is 
deemed sufficient. 

While this study has focussed upon parameterisation of a uncertainties, 
the joint likelihood technique represents a highly flexible and extensible ana- 
lytical approach. Indeed, related analyses are able to parameterise arbitrary 
properties of the instrument response, the measured 7-ray signal, and even sus- 
pect ed source propert i es suc h as the spectral index and temporal variability 



(e.g. lAckermann et al.l . 120111) 



7. Conclusions 

Toy Monte Carlo simulations have been used to investigate the effect of non- 
negligible systematic uncertainties affecting the relative normalisation of the ON 
and OFF-source event sampling regions in atmospheric Cherenkov telescope ob- 
servations. As a specific example, the properties of two alternative strategies 
for combined source analysis have been examined. Situations have been iden- 
tified in which the traditional data stacking approach exhibits unexpected and 
undesirable statistical behaviour. Even when correctly calibrated null distri- 
butions for the combined significance are employed, the sensitivity of the data 
stacking approach appears markedly impaired if asymmetrically distributed a 
uncertainties are assumed. The data stacking technique is also yields unreli- 
able estimates for the 7-ray signal parameters, with derived confidence intervals 
deviate from their stated coverage by a margin which widens with increasing 
target multiplicity. 

An alternative to data stacking has been outlined which combines target- 
specific likelihood functions that explicitly account for non-uniform systematic 
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uncertainties. As an analytical approach, the joint likelihood technique exhibits 
many characteristics which surpass those of traditional data stacking. It offers 
a statistically robust prescription for the treatment of target-specific systematic 
uncertainties, and effectively addresses the shortcomings which are inherent in 
the data stacking framework. Indeed, the joint likelihood effectively eliminates 
pathological behaviour inherent in data stacking, with combined source signifi- 
cances conforming to a single, nominal null distribution. Moreover, it achieves 
equivalent or superior sensitivity to the data stacking technique and yields pa- 
rameter confidence intervals that exhibit minimal deviation from their stated 
coverage. 

Many of the insights provided by this study of combined source analyses are 
equally applicable to single-target analyses involving similarly abundant event 
statistics. Indeed, the synthesis of any extensive dataset often combines obser- 
vations which incorporate diverse systematic effects. This study has shown that 
the joint likelihood method provides a viable approach for the robust combi- 
nation of data with inhomogcneous systematic uncertainties, irrespective of the 
celestial target coordinates. Although it is a generally applicable technique, the 
joint likelihood analysis is likely to prove most useful in experimental situations 
involving limited event statistics and a restricted choice of OFF-sourcc sampling 
regions. 
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Figure 2: Monte Carlo distributions for S corresponding to the Joint Likelihood and Data 
Stacking approaches. Each histogram is representative of 5000 independent Monte Carlo 
datasets, which were generated assuming Nq = 0. Results are shown which correspond Model 
A {top row), Model B (middle row) and Model C {bottom row) for target multiplicities 
m £ [1, 10]. All Monte Carlo datasets were generated assuming iVopp = 100 and « = 0.1. 
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Figure 3: Left-hand column: Calibrated significance threshold values corresponding to the 
95th percentile of the Monte Carlo null distributions of S for the Joint Likelihood (blue) and 
Data Stacking (red) approaches. Right-hand column: Differences between the derived powers 
corresponding to the Joint Likelihood and Data Stacking approaches. The plotted results 
correspond to Model A {top row), Model B (middle row) and Model C (bottom row). Each 
bin is derived using analyses of 5000 independent Monte Carlo datasets which assume to true 
signals TVg G [0,10] and target multiplicities m G [1,10]. All Monte Carlo datasets were 
generated assuming JVoff = 100 and a = 0.1. 
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Figure 4: Statistical powers derived using St corresponding to the 95th percentile of the null 
distributions of S for the Joint Likelihood {left-hand column) and Data Stacking {right-hand 
column) approaches. Each bin is derived using analyses of 5000 independent Monte Carlo 
datasets. The plotted results correspond to Model A {top row), Model B {middle row) and 
Model C {bottom row) assuming true signals A^g £ [0, 10] and target multiplicities m S [1, 10]. 
The underlying datasets are as for Figure [3] 
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Figure 5: Joint Likelihood (JL) and Data Stacking (DS) coverage characteristics corresponding 
Model A {top row), Model B {middle row) and Model C {bottom row). Each datapoint is 
derived using analyses of 5000 independent Monte Carlo datasets and indicates the fraction 
of generated 95% confidence intervals on A^g which include the true signal value, Ng. The 
underlying datasets are as for Figure |3] 
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Figure 6: Residuals with respect to nominal coverage for 95% confidence intervals on Nq 
corresponding to the Joint Likelihood (JL) and Data Stacking (DS) approaches. Results are 
shown which correspond Model A (top row), Model B (middle row) and Model C (bottom 
row). The underlying datasets are as for Figure [3] 
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Figure 7: Calibrated significance threshold values corresponding to the 95th percentile of the 
Monte Carlo null distributions of S for the and Data Stacking approach. The illustrated 
results correspond to Gaussian (left) and bifurcated Gaussian (right) parameterisations for 
G{a\a), using target-specific uncertainties 1% < aa/a < 10% for the symmetric models, and 
1% < ctq p^/(5 < 10% with cTa^R = 2f7c,L for the asymmetric models. Each datapoint is 
derived using analyses of 5000 independent Monte Carlo datasets, which assume ensembles 
of m = 50 targets having true signals TVs = 0. All Monte Carlo datasets were generated 
assuming Nopp = 100 and o = 0.1. 
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